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SYSTEM 



Th.s invention relates to data processing systems. More particularly this 
invention relates to the selection of a perfonnance level to be tased by a' data 
5 processing system capable of operating at a plurality of different performance levels. 

It is known to provide data processing systems capable of operating at a 
plurality of different performance levels. Typically, a relatively low performance and 
low power consumption performance level will be used when maximum processing 
0 performance is not required whereas when processing intensive operations are being 
performed, then a higher perfomiance level will be selected at the expense of 
consuming more power. 

As an example of the type of processing systems capable of operating at 
different perfonnance levels, the processor, produced by Intel Corporation 
mcon^orating their SpeedStep technology operate in a high power, high speed mode 
as well as one or more lower power low speed modes. Switching between these 
perfonnance levels is typically canied out in dependence upon sensed external 
parameters, such as whether or not the system is comiected to a mains power supply 
or a battery power supply. 

It is also known to provide more dynamic perfonnance level management 
based upon the dynamically detennined processing demands placed upon the data 
processmg system An example of such an approach is the LongRun software control 
of processor dock speed applied in the processors produced by Tr^smeta. Such 
software attempts to reduce the processor clock frequency and accordingly the power 
consumed when the processing demands being placed upon the processor are light 
and mcrease the clock frequency to obtain high perfonnance when the processing 
demands are greater. 



A problem with this approach is that in order to ensure that the power saving 
techniques do not interfere with the usability of the system, the software tends to 
make safe assumptions regarding the desired perfonnance level and to mn the system 
at a higher average perfonnance level than is tmly required. TTiis wastes power In 



5 performance evel to be used hv a w,f, "isciecunga 

^^"'^'*''y^''^'^P'-°'=^ssing apparatus capable of operating at a 
plurality of different perfonnance levels said meth.^ • • 

F ""*=^'eveis, said method comprising the steps of- 
calculating a plurality of performance requests using respective ones of a 
plurality of performance request calculating algoriftms- 

combining said plurality of perfonnance requests to fonn a global 

10 performance request; and ^ 

selecting said perfonnance level to be used by said data pn^essing apparatus 

xrirr °' ^""^ ^ ---- - 

. .eve, « ^ , 

ri ? ~ « ^ - »=h „ j::: 

2! ontehes the perfomance level truly needed 



n» p^o™»„e ™,uet. elcJatog <Jgorith„„ „e p,efe™b,y Wepenta „f 
»cb other „a b.. «,eit e^c^al.. „p„„ ^ ^ 

30 data processing apparatus. 

Preferred embodiments of the invention also allow the steps of calculating and 
co^bmmg to be temporally independent of the step of selecting and to allow t 
different performance request calculating algorithms to be temporally independent of 
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71 , ''''' '^'^'""^^ -^^^ Perfonnance 

request calculating algorithms can base their calculations and yet not restrict the 
frequency or timing of when performance level changes may be made. 

combmed m dependence upon the position ^thin the hierarchy of their originating 
performance request calculating algorithm. 

.0 ;^ -"''-P~d that the hierarchy may be fuUy ordered or parti^^^ 

ordered. In the case of partial ordering an operator may be provided for combining 
pe^nn^ce requests f^om the same hie^chy level (e.g. a maximum value selector, 
W,th^n the hierarchy model, performance requests originating from a performance 
calculating algorithm with a mo. dominant position within the hierarchy th^ other 
15 -questswiUbeselectedinpreferencetothoseotherrequests. 

improved flexibility may be achieved in the way that performance requests are 
combmed by providing that the performance requests are accompanied by commands 
generated by the perfonnance request calculating algorithms which specify how that 
P^ormance request should be combined with other performance requests TT,us a 
performance request calculating algorithm can be considered to express a measure 'of 
■ts confidence in the performance requested as generated by specifying how that 
performance requests should be combined with other performance requests. 

As preferred examples, the commands can specify that a performance request 
should override any performance requests from a less dominant position in the 
luer^chy. should be selected in preference to any lower perfonnance level 
performance request from a perfom^ance calculating algorithm in a less dominant 
position or should be ignored. 

The combining of performance requests is conveniently and methodically 
perfom,ed when these are combined starting from a performance request from a least 
dommant position working through to a perfom^ance request from a most dominant 
position. 



differ 7/ """'^"^"^ '-'-^'^^ P-^-ed in 

pans of the syste. such as an operating system ken,el, fi™ of the data 
processing apparatus or hardware within the rf«t» n • '^eoimedata 

The perfomance request calculating algorithms car h. r. 
of diffprf.nt r, . '"Soninms can be responsive to a variety 

calculating logic operable to calculate a pluralitv nf „ r 

combimng logic operable to combine said plurality of n.rf. 
fonn a global performance request; and I-rfo-ance « to 

selecting logic operable to select said perfonnance level to be used by said 
dataprocessmgapparatusfiomamongsaidpluralitvofdiffer t . "^^'^ ^^'^ 
dependenceuponsaidglob^perfonnLcereZr "^™^^'^^"^^ 

25 

Viewed from a further aspect the present invention provides a computer 

30 comprising: ■ Pn>g™ 

cloutoing code op„.b,c „ o,1o„,.k . 

lorm a global performance request; and 
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selecting code operable to select said performance level to be used by said data 
processing apparatus from among said plurality of different performance levels in 
dependence upon said global performance request. 

An example of the present invention will now be described, by way of 
example only, with reference to the accompanying drawings in which: 

Figure 1 schematically illustrates how a power management system according 
to the present technique may be implemented in a data processing system; 

Figure 2 schematically illustrates three hierarchical layers of the performance 
setting algorithm according to the present technique. 

Figure 3 illustrates the strategy for setting the processor performance level 
during an interactive episode; 

Figure 4 schematically illustrates execution of a workload on the processor 
and calculation of the utilisation-history window for a task A; 

Figure 5 schematically illustrates an implementation of the three-layer 
hierarchical performance policy stack of Figure 2; 

Figure 6 schematically illustrates a work-tracking counter 600 according to the 
present technique; 

Figure 7 schematically illustrates an apparatus that is capable of providing a 
number of different fixed performance-levels in dependence upon workload 
characteristics; 

Figure 8 is a table that details simulation measurement results for a 
*plaympeg' video player playing a variety of MPEG videos; 

Figure 9 is a table that lists processor performance levels statistics during the 
runs of each workload; 



MPEG movies entitled 'Legendary' fFieure 1 OA^ <T^ . 

B); ^ ^ 'Je Cable- (Figure 10 

5 

& mgonmms tested on interactive workloads; 



25 



30 



kemelinnh • / g system. The data processing system comprises a 

irr,:r::r™' """" -ii 

, :>uiieauier 1 14 and a conventional power manaa^^r 1 1 ^ a • ». 
n-^ersystemnOisimplementedintheLerdl^^^^^ 
• -2. a performance setting control module .4 and an J:^^:^:^: 
user processes layer 130 comprises a system calls module 132 a Jl 
module 134 and application specific data 136 7T,e "management 

infonnation to the kernel 100 !" ■ '^^^ ^"PP"-« 

nto the kernel 100 v,a an application-monitoring module 140. 

The kernel 100 is the core that provides basic services for other parts of th 
operating system. The kernel can be contrasted with the shell which 7 
pan Of .e opera^ng system ^t .teracts user Z:T;::r::Z 

systtmcail n«i,,i * "<»fP»8™niMeracc! known,! 
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116 manages the supply voltage by switching the processor bet,^en a power- 
conserv.ng sleep mode and a standard awake-mode in dependence upon the level of 

processor utilisation. 

5 The intelligent energy manager 120 is responsible for calculating and setting 

processor performance targets. Rather than relying only on sleep mode for power 
conservation, the intelligent energy n^anager 120 allows the central processing unit 
(CPU) operating voltage and the processor clock frequency to be reduced without 
causing the application software to miss process (i.e. task) deadlines. When the CPU 
10 as runnmg at full capacity many processing tasks will be completed in advance of their 
deadhnes whereupon the processor will idle until the next scheduled task is begun 
An example of a task deadline for a task that produces data is the point at which the 
produced data is required by another task. Tl.e deadline for an interactive task would 
be the perception threshold of the user (50 -100 ms). Going at full perfonnance and 
S then Idling is less energy-efficient than completing the task more slowly so that the 
deadlme is met.mor^ exactly. When the processor frequency is reduced, the voltage 
may be scaled down in order to achieve energy savings. For processors implemented 
m complementary metal-oxide semiconductor (CMOS) technology the energy used 
for a given workload is proportional to the voltage squared. THe policy co-ordinator 
manages multiple performance-setting algorithms, each being appropriate to different 
run-time situations. THe most suitable perfonnance-setting algorithm to a given 
condition is selected at run-time. The performance setting control module 124 
receives the results of each performance-setting algorithm and repeatedly calculates a 
target processor perfonnance by prioritising these results. The evem tracing module 
126 monitors system events both in the kernel 110 and in the user process layer 130 
and feeds the information gathered to the performance setting control module 124 and 
the pohcy co-ordinator 122. 

In the user processes layer, processing work is monitored via: system call 
events 132; processing task events 134 including task switching, task creation and 
task exit events; and via application-specific data. The intelligem energy manager 
120 ,s implemented as a set of kernel modules and patches that hook into the standard 
kernel functional modules and serve to control the speed and voltage levels of the 
processor. The way in which the intelligent energy manager 120 is implemented 



0 



makes it relatively autonon,ous ^r. other modules in the kernel 100. TMs has the 
a v-ge Of .akin, the perfo.ance setting control mechanism less intrusive to 2 
host operatmg system. Implementation in the kernel also means that u.r appHcati!n 
programs need not be modified. Accordmgl, intelligent energy manag^iao cI 
S extsts ..th the system calls module „, the scheduler 114 and *e conven oil 

ub systems. The mtelhgent energy manager 120 is used to derive task deadlines and 
tas Classification info^ation (e.g. whe«,er the task associated with an interaLv 
apphcafon) .om the OS kernel by examining the communication patterns be^^: 
.0 *-e~^^ , ,30 se.es to monitor which system c^ls are accessed by ea I 
taskandhowdataflowsbetweenthecommunicationstructu.sinthekemel. 

Figure 2 schematically illustrates three hierarchical layers of the performance 

--^^^-«--cordingtothepr.enttechni,ue.ItshouIdbenotedt^atonagi „ 
S ---the^e^uency/voltagesettingoptionsaretypic^^^ 

afirdrof^T^'''''^"^^^ 

a fixed set of predetermmed values. Whereas known techniques of calculating a 
target processor performance level involve use of a single perfonnance-setting 
algorithm, the present technique utilises multiple algorithms each of which have 
different chat^teristics appropriate to different run-time situations. The most 
apphcable algorithm to a given processing situation is selected at run-time Tl,e 
pohcy coordinator modulel22 co-ordinates theperformancese^^^ 

by onnectmg to hooks in the standard kernel no, provides shared fUnction^^^ 
mult^le perfonnance setting algorithms. 

settmg algorithms are collated and analysed to detennine a global estimate for a target 

P^performance level. The v.ious algorithms are organised intoadecision 
hierarchy (or algorithm stack) in which the performance level indicators output by 

a^gont^s at upper(moredominant)levelsofthehierarchy have the right to oveL 
^ performance level indicator, output by algorithms at lower (less dominant) levels 
0 he Hierarchy. The example embodiment of Figure 2 has three hierarchical levels 
A the uppennost level of the hierarchy there is an interactive application perfonnance 

;nd.ator 10,atthemiddlelevelthereisanapplication-specificperfor.l 
indicator 220 and at the lowermost level of the hierarchy there is a task-based 
processor utilisation peiformance indicator 230. 
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The interactive application performance indicator210 calculation is 
perWdby an algorithm based on that described in Flautner et. AI. "Automatic 
Perfonnance-setting for Dynamic Voltage Scaling", Proceed^,,, of the International 
Conference onMoMe Computing and Networking, My 2001. The interactive 
apphcation performance-level prediction algorithm seeks to provide guarantees good 
mteractrve performance by finding the periods of execution that directly impact the 
user experience and ensuring that these episodes complete without undue delay The 
algor.thm uses a relatively simple technique for automatically isolating interactive 
episodes. This technique relies on monitoring communication from the X server 
whrch is the GUI controller, and traclcing the execution of the tasks that get trigg^ed 
as a result. 



The begmmng of an interactive episode (which typically comprises a 
mulfphcity of tasks) is initiated by the user and is signified by a GUI event, such as 

pressmgamouse button orakey on the keyboard. Asar^sult this event, the GUI 
comroIer(Xsen,erinthiscase)dispatchesamessagetothetaskthatisrespo^^^^^ 
for handhng the event. By monitoring the appropriate system calls (various versions 
Of r^d. wnte. a.d select, the intelligent energy manager 120 can automatically detect 
the begrmung of an interactive episode. When the episode starts, both the GUI 

o^ntroller and the task that is the r^ceiverofthe message are marked as being in^ 
mt.ac.ve episode. If tasks of » interactive episode communicate with unmarked 
tasks, then the as yet umnarked tasks are also marked During this process, the 
mtelhgent energy manager 1 20 keeps track of how many of the marked tasks have 
been pre-empted. The end of the episode is reached when the number of pre-empted 
tasks ,s zero, indicating that all tasks have run to completion. 

Figure 3 illustrates the strategy for setting the processor perfonnance level 
dunng an interactive episode. The duration of an interactive episode is known to vary 
by several orders of magnitude (from around lO^s up to around 1 second). However 
a transuion-start latency or 'skip threshold' of 5 milliseconds is set to filter out the 
shortest interactive episodes thereby reducing the number of requested performance- 
level tr^suions. The sub-millisecond interactive episodes are typically the results of 
echoing key presses to the window or moving the mouse across the screen and 



« ' Sine ftis 

w.to> inversely impa«,i„gaew.M case. 

15 

y» s sp«,aed tt, b. .he max™™ perfo™a,c. level. Since fti, i» a ,op.Ievel 

^^de *e^«vea,s„H«™eo„pu»,^,«.e„„eo.pe*_ eI:X 
*e ep,„* Should have bee. and «a eo™«e<, »,„ is i,e„,.^ ^ 
expo„e.„.„y decaying ave„,e so *a, i, ^„ ^ 

op„„is.a„„ is perf,™^ iffte panic .h^.d ™s ^c, 

-^da^s an «i„ episode, *e„„vi.gave^is^,ca,ed,o«„,J 
™iperfo™^ce,eve,is^eo„i.,ce,p.„e.,a„yde„,ingave„;e 

™*.b.gher™,g„,(k.,is„s.di.s,eadofk.3,.:T,eperf„™,„ccpJc«onis 
oompuicd for .„ episodes ft,, are longer to «„ s„p „,csh„„ 

ime^dvc episode 'deadli^s. are used obuin a p„f„„™^,„e, 
for each ide„,i„cd interactive episode, 7^ de«„e is L U^^ l 

""""" » P-or™.ce 

P^fo^nce ,eve> .nd,ca„r for in.eraciive episodes is ca,cn,a»d in dependence upon 
*. human percepfon dneshcd associated wift d,c pa„cu,ar i.«„c«ve even, Z 
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example, it is known that a rate of 20 to 30 frames per second is fast enough for the 
user to perceive a series of images as a continuous stream so that the perception 
threshold could be set to 50ms for an interactive image display episode. Although the 
exact value of the perception threshold is dependent on the user and the type of task 
bemg accomplished, a fixed value of 50ms was found to be adequate for the 
interactive algorithm of the hierarchy. Tie equation below is used for computing the 
performance requirements of episodes that are shorter than the perception threshold 



Perception Threshold 



where the full-speed equivalent work Work,^ is measured from the beginning 

) of the interactive episode. 

The application-specific performance indicator 220 of the middle hierarchical 
layer is obtained by collating infomration output by a category of application 
programs that are aware of performance level setting functionality. TTrese program 
applications have been adapted to submit (via system calls) specific information to the 
mtelhgent energy manager 120 about their specific performance requirements Tire 
operating system and application programs can be provided with new API elements to 
facilitate this communication regarding performance requirements. 

The perspectives-based perfomiance indicator 230 is obtained by 
mrplementing a perspectives based algorithm that estimates fiiture utilisation of the 
processor based on the recent utilisation history. This algorithm derives a utilisation 
estimate for each individual task and adjusts the size of a the time period over which 
the utrhsation-history is calculated (i.e. the utilisation-history window) on a task by 
task basis. The perspectives-based algorithm takes account of all categories of task 
bemg performed by the processor whereas the interactive application algorithm of the 
uppermost layer takes account of interactive tasks. Since the interactive application 
algonthm calculates a performance-level indicator that aims to guarantee a high 
quality of interactive performance and it is situated at the uppemiost level of the 
h.erarchy, the perspectives-based algorithm need not be constrained to a 
conservatively short utilisation-history window TTie possibility of using longer 



u^.at.on-h,story windows at this lowennost hierarchical level allows for improved 
efficncy since a .ore aggressive power reduction strategy can be selected when 
appropnate. If the utilisation-history window is too short, this can cause the 
perfonnance-level predictions to oscillate rapidly between two fixed values It is 
typical y necessao' to set a short utilisation histoo^ window where a single unified 
^,on^ (rather than a hiera^hical set of algorithms) is used to set the perfon^ance 
level for all run-time circumstances. To be able to cope with intemxittent processor- 
mtens.ve mte^ctive events, such unified algorithms must keep the utilisation-history 
Window short. ^ 
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Each of the performance- setting algorithms of the 3-layer stack uses a measure 
of processing work-done in a given time interval, h. this embodiment, the work-done 
measure that is used is the full-speed (of the processor) equivalent work JVork, 
perfonnedinthattimeinterval. ™s full-speed equivalent work estimate is calculated 
15 according to the following fommla: 

-'•^----ofndifferentprocessorperfonnancelevelsimplementedduring 
the given time interval; U is the non-idle time in seconds spent at perfonnance level i- 
and p. .s the processor perfonnance level i expressed as a fiaction of the peak (full-' 
speed) processor performance level. This equation is valid on a system in which a 
time-stamp counter (work counter) measures real time. The work-done would be 
calculated differently in alternative embodiments that use cycle counters whose count 
mte vanes according to the cu.ent processor frequency. Furthermore, the above 
equation makes the implicit assumption that the run-time of a workload is inversely 
proportional to the processor frequency, assumption provides a reasonable 

estimate of work-done. However, primarily due to the non-linearity of bus speed to 
processor speed ratios during perfonnance scaling, the assumption is not always 
accurate. In alternative embodiments the work-done calculation can be fme-tuned to 
take account of such factors. 



Figure 4 schematically illustrates execution of a workload on the processor 
and calculation of the utilisation-history window for a task A. The horizontal axis of 
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Figure 4 .presents time. Task A first starts execution at time S. whereupon a number 
of per-task data structures are initialised. TT^er^ are four of these data structures 
corresponding to the following four pieces of infonnation: (i) the current state of the 
work counter; (ii) the current (real) time; (iii) the current state of an idle-time counter; 
and (rv) a run bit is set to logical level ' 1 ' indicating that the task has started rum^ing. 
The work-counter, the r.al-time counter and the idle-time counters are used to 
calc^ate the processor utilisation associated with task A and subsequently to calculate 
ask A's performance requirements. At time PE. task A has not yet run to completion 
but .s pre-empted by another task, task B. Pre-empting will occur when the task 
scheduler 1 .4 detennines that another ta^ has higher priority th» the task that is 
currently nmning. When task A is pre-empted the run bit is maintained at logical level 
of 1 to mdicate that the task still has work to complete. At tune RE, task A resumes 
execution, having been rescheduled, and continues to execute until it has run to 
comp e.on at tune TC whereupon it voluntarily gives up processing time. On 
completion task A may initiate a system call that yields the processor to another task. 
On completion of task A at time TC the run-bit is reset to logical level '0'. 

After time TC, there is an idle period followed by execution of a further task C 
and a subsequent idle period. At time RS, task A begins execution for a second time. 
A trme RS. the '0' state of the run-bit associated with task A indicates that 
.nformatron exists to enable calculation of task A's performance requirements so that 
the processor target performance level can be accordmgly set for the imminent re- 
-ecut.o„ of task A. The utilisation-history window for a given task is defiraed to be 
*e penod of time from the start of the first execution of the given task to the start of 
the su sequent execution of the given task and should include at least one pre-empting 
event (task A is pre-empted by task B at point RE in this case) of the given task within 
*e re evant window. Accordingly, in this case the utilisat^on-histoiy window for task 
A IS efined to be the tune period from time S to time RS. The target performance 
level for task A in this wmdow is calculated as follows: 
fVorkEst^^ = (kxWorkEstou+ Workf^/(k+l) 

Deadline,,^ = (k xDeadlimou+ (Workf^ + Me)) / (k+ 1) 



where k is a weighting factor Idle i^ th^ i'hIo • 
between time points rc and RS is the "slack" that, k . ^''""''^ 

estimates. The weighting factor Ic ic . ^ 
predict, .e,^., J^"^^ -" "f 

cpreaiction. The performance level indicator Pe/-/- <• ^. 

algorithm is given by the ratio of th. for this 

estimates fVorkE.t for = „• ... ^ techmque. the work 
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According to the perspectives-based algorithm of the present technique it is 
necessary to avoid a situation occurring whereby a new non-interactive CPU bound 
task utilises the processor for an extensive period without being pre-empted This 
could introduce substantial latency in adaptation of the performance level to the task 
smce the utilisation-history window can only be defined once the task has been pre- 
empted at least once. To avoid unwanted performance adaptation latency an upper 
threshold is set for the non pre-empted duration over which the work estimate is 
calculated. In particular, if a task continues without being pre-empted for 100 
milhseconds. then its work estimate is recalculated by default. TTie value of 100 
milliseconds was selected by taking into account that a more stringent application- 
history window is ensured for interactive applications via the dominant hierarchical 
layer 210, which produces a separate mteractive application performance indicator It 
was also considered that the only class of user applications likely to be affected by the 
100 millisecond window threshold are computationally intensive batch jobs such as 
compilation, which are likely to run for several seconds or even minutes. In such 
cases an extra 100 milliseconds (0.1 seconds) of run time is unlikely to be significant 
perfonnance-wise. 



Figure 5 schematically illustrates an implementation of the three-layer 
hierarchical perfonnance policy stack of Figure 2. The implementation comprises a 
performance indicator policy stack 510 and apoUcy event handler 530, each of which 
outputs information to a target performance calculator 540. The target performance 
calculator 540 serves to collate the results from four performance-setting algorithms- 
top level interactive algorithm, middle level application-based algorithm and two 
different lower level algorithms. The four algorithms are capable of being run 
concurrently. The target perfonnance calculator 540 derives a single global target 
performance level from the multiple perfomiance indicators (in this case four) 
produced by the policy stack 510. The policy stack 510 together with the policy event 
handler 530 and the target performance calculator 540 provides a flexible framework 
for multiple perfomiance-setting policies so that policy algorithms of each level of the 
stack can be replaced or interchanged as desired by the user. Accordingly the 
performance policy-stack provides a platform for experimentation in which user- 
customised performance-setting policies can be incoiporated. 
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peak level. The target performance calculator uses an operator to combine these two 
equal-priority requests, in this case preferentially selecting the 55% value as the level 
0 performance-indicator. At level 2, the command 'SET IF GREATER THAN' has 
been specified together with a performance indicator of 80%. The 'SET IF 
GREATER THAN* command provides that the target performance calculator 540 
should set the global target perforaiance -level to be 80% provided that this is greater 
than any of the performance indicators from lower hierarchical levels. In this case the 
level 0 performance indicator is 55% and the level 1 performance indicator is to be 
disregarded so that the global target will indeed be set to 80% of peak performance. 

Since the most recently calculated performance level indicators for each 
algorithm are stored in memory by the policy stack 510, the target performance 
calculator 540 can calculate a new global target value at any time v^thout having to 
invoke each and every performance-setting algorithm. When a new performance level 
request is calculated by one of the algorithms on the stack, the target performance 
calculator will evaluate the contents of the command-performance data structures 
from the bottom level up to compute an updated global target performance level 
Accordingly in the example of Figure 5, at level 0 the global prediction is set to 55%, 
at level 1 it remains at 55% and at level 2 the global prediction changes to 80%. 
Although each of the performance-setting algorithms can be triggered (by a 
processing event in the system) to calculate a new performance level at any time there 
is a set of common events to which all of the performance -setting algorithms will 
tend to respond. These events are monitored and flagged by the policy event handler 
530, which provides policy event information to the target performance calculator 
540. This special category of events comprises reset events 532, task switch events 
534 and performance change events 536. The performance change event 536, is a 
notification that alerts each performance setting algorithm to the current performance 
level of the processor although it does not usually alter the performance requests on 
the policy stack 510. For this special category of policy events 532, 534, 536, the 
global target level is not recomputed each time one of the algorithms issues an 
updated performance-level indicator. Rather, the target performance level calculation 
is co-ordinated so that the calculation is performed once only for each event 
notification after all event handlers of all interested performance setting algorithms 
have been invoked. 
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Table 1 lists measurement data that gives a percentage discrepancy between 
expected run-time du^tion and an actual run-time duration for both a CPU bound 
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loop and for an MPEG video workload when considering a performance-level 
transition between two different processor speeds (from a higher to a lower speed in 
this case). The results are based on post-transition runs at three distinct processor 
perfonnance levels: 300, 400, and 500 Mhz (as specified in the left-most column of 
table). The top row of Table 1 lists the initial perfonnance level from which the 
transition to the corresponding processor speed in the left-most column was made. 
On the CPU bound loop, the difference between the predicted and actual 
measurements are indistinguishable from the noise, whereas for the MPEG workload 
there is about a 6%-7% inaccuracy penalty per 1 00 Mhz step in processor frequency, ' 
The maximum inaccuracy on these workloads is seen to be less than 20% 
(19.4%),which is considered to be acceptable for a system with only a few fixable 
performance-levels. However as the available range of minimum to maximum 
processor performance levels that are selectable in a system increases and the range of 
each perfoimance-level step decreases, it is likely that a more accurate work- 
estimator than the processor speed will be required. 



Post- 
transition 
speed 



300 MHz 



400 MHz 



CPU BOUND T.OOP 



400 MHz 



-0.3% 



500 MHz 



500 MHz 



-0.4% 



-0.1% 



600 MHz 



-0.3% 



0.0% 



0.1% 



MPEG VIDEO WORRT OAn 



400 MHz 



7.1% 



500 MHz 



13.5% 



6.9% 



600 MHz 



19.4% 



13.3% 



6.8% 



The more sophisticated algorithm of alternative example embodiments uses a 
more accurate work-done estimation technique that involves monitoring the 
instruction profile (via counters that keep track of significant events such as memory 
accesses) and the expected and actual decrease rate of the workload, rather than 
making the assumption that the work-done is directly proportional to the processor 
speed. Further alternative embodiments use cache hit-rates and memory-system 
performance indicators to refine the work-done estimate. Yet further alternative 
example embodiments use software to monitor the percentage of processing time used 
in executing a programming application (equated to useful work-done) relative to the 
percentage of processing time used in performing background operating-system tasks. 

The hardware control module 630 is capable of estimating work-done even 
during transition periods when the processor is in the process of switching between 
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two fixed perfonnance levels. For each p^cessor perfonnance transition there may 
be a pause of around 20 microseconds during which the pn,cessor does not issue any 
instructions. This pause is due to the time needed to resynchronise the phase-locked - 
loops to the new target processor frequency. Furthermore, before the processor 
fi^quency can be changed, the voltage must be stabilised to an appropriate value for 
the newtarget frequency. Accordmgly. there is a transition time of up to 1 
millisecond, during which it can be assumed that the processor is rumung at the old 
target frequency but energy is being consumed at the new target level (since the 
voltage has been set to the new target level). TTie fi^quency may be ramped up in 
several stages via intermediate frequency steps to affect the perfonnance-Ievel 
change. During such transition periods when the frequency of the processor is ' 
changing dynamically the hardware control module 630 is operable to update the 
.nciement value register taking account of the dynamic changes of which the software 
IS unaware. Although this example embodiment makes use of both hardware and 
software control modules 620, 630 to calculate the work done, alternative example 
embodiments may use only one of these two modules to estimate the work-done. 

The accumulator(s) module 640 periodically reads the increment value from 
the mcrement value register 610 and adds the increment value to an accumulated sum 
stored in the work-count value register. The work-count value register increments the 
work-count value every clock-tick. TTae clock-tick is a time signal derived from the 
real-time clock 650. To measure the work-done during a predetermined time-interval 
the work-K.ount value stored in the accumulator(s) module 640 is read twice: once at 
the begimung of the predetermined time interval and once at the end. n,e difference 
between these two values provides an indication of the work-done during the 
predetermined time interval. 



The real-time clock 650 also controls the rate at which the time-count value 
stored in the register 644 is incremented. TOs time-count value register works on the 
same time base as the work-count value but is used to measure time elapsed rather 
than work done. Having both a time counter and a work-done counter facilitates 
performance-setting algorithms. The time-base register 646 is provided for the 
purpose of muW-platfonn compatibility and conversion to seconds. It serves to 
specify the time base (frequency) of the two counters 642. 644 so that time can be 
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accurately and consistently be i.e. the accumulated value stored in the time count 
value register provides an indication of the time elapsed in milliseconds. The control 
register module 660 comprises a two control registers, one for each counter. A 
counter can be enabled, disabled or reset via the appropriate control register. 

5 

Figure 7 schematically illustrates an apparatus that is capable of providing a 
number of different fixed performance-levels in dependence upon workload 
characteristics. The apparatus comprises a CPU 710, a real-time clock 720, a power 
supply control module 730 and the increment value register 610 of the work-tracking 
10 counter of Figure 6. The power supply control module 730 determines which of the 
fixed performance-levels the CPU is currently set to run at and selects an appropriate 
clock fiequency for the real-time clock 720. The power-supply control module 730 
inputs information on the current processor frequency to the increment value register 
610. Accordingly the value of the increment is proportional to the processor 
5 frequency, which in turn provides an estimate of usefiil work-done by the processor. 

Many of the performance-setting algorithms of the policy stack 510 use the 
utilisation history of the processor over a given time interval (window) to estimate the 
appropriate fiiture target speed of the processor. The principal objective of any 
) performance-setting policy is to maximise the busy time of the processor in the period 
from the start of execution through to the task deadline by reducmg the processor 
frequency and voltage levels an appropriate target performance level. 

To enable the target performance level to be realistically predicted, the 
intelligent energy manager 120 provides an abstraction for tracking the actual work 
done by the processor during a given time interval. This work-done abstraction 
allows performance changes and idle time to be taken into account regardless of the 
specific hardware counter implementations, which can vary between platfomis. 
According to the present technique, to obtain a work measurement estimate over a 
time interval, each performance-setting algorithm is allocated a 'work structure' data 
structure. Each algorithm is set up to call a 'work-start function' at the beginning of 
the time interval and a 'work-stop function' at the end of the given time interval. 
During the work-done measurement, the contents of the work structure are 
automatically updated to specify the proportion of idle time and the proportion of 



utilised processor time weighted by the respective performar,ce levels of the 
processor. The information stored in the work structure is then used to compute the 
full-speed equivalent work value (Work,,), which is subsequently be used for target 
performance-level prediction. This work-done abstraction functionality which is 
5 implemented in software in the intelligent energy manager 120 provides performance- 
level prediction algorithm developers with a convenient interface to the intelligent 
energy manager 120. The work-done abstmction also simplifies porting of the 
performance-setting system of the present technique ,„ different hardware 
architectures. 

One significant difference between altemative hardware platforms is the 
mamier in which time is measured on the platfom.. In particular, some architectures 
provide a low overhead method of cycle-counting via timestamp counters whereas 
other architectures only provide the user with externally programmable timer 
mterrupts. However even when timestamp counters ^ provided they do not 
necessarily measure the same things. For example a first category of haniware 
platfonns includes both current Intel [RTM] Pentium and ARM \RTM] processors. In 
these processors the timestamp counters count CPU-cycles so that the count-rate 
varies in accordance with the speed of the processor and the counter stops counting 
when the processor enters into sleep mode. A second category of hardware platfomas 
which includes the Crusoe [RTT^ processor, have an implementation of the' 
timestamp counter that consistently counts the cycles at the peak rate of the processor 
and continues to increment the count at the peak rate even when the processor is in 
sleep mode. The work-done abstraction facilitates implementation of the present 
target performance-setting technique on both of these two altemative categories of 
hardware platform. 



The work estimate IVork^^ as calculated in this embodiment does not take 
account of the fact that a given workload rumring at half of peak perfomiance does not 
necessarily take twice as long to run to completion as it would at the full processor 
speed. One reason for this counter-intuitive result is that although the processor core 
is slowed down, the memory system is not. As a result, the core to memory 
performance ratio improves in the memory's favour. 
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Simulations were performed to evaluate the present performance-setting 
technique against a known technique. In particular, the known technique is a 
'LongRun' power manager that is built-into a Transmeta Crusoe processor. The 
Transmeta's Crusoe processor has the LongRun power manager built into the 
processor finnware. LongRun is different from other known power management 
techniques in that it avoids the need to modify the operating system in order to effect 
the power management. LongRun uses the historical utilisation of the processor to 
guide clock rate selection: it speeds up the processor if utilisation is high and 
decreases performance if utilisation is low Unlike on more conventional processors 
the power management policy can be implemented on the Crusoe processor relatively 
easily because the processor aheady has a hidden software layer that perfonns 
dynamic binary translation and optimisations. The simulations aimed to establish 
how effectively a policy such as LongRun that is implemented at such a low level in 
the sofhvare hierarchy can perform. The present technique was run alongside 
LongRun on the same processor. 



The simulations were performed on a Sony Vaio [RTM] PCG-CIVN 
notebook computer using the Transmeta Crusoe 5600 processor rumiing at a number 
of fixed performance levels ranging from 300 Mhz to 600 Mhz in 100 Mhz 
performance-level steps. The simulations used a Mandrake 7.2 operating system with 
a modified version of the Linux 2.4.4- acl8 kernel. The workloads used in the 
comparative evaluation were as follows: Plaympeg SDL MPEG player library- 
Acrobat Reader for rendering PDF files; Emacs for text editing; Netscape Mail and' 
News 4.7 for news reading; Konqueror 1.9.8 for web browsing; and Xwelltris I 0 0 as 
a 3D game. The benchmark used for interactive shell commands was a record of a 
user performing miscellaneous shell operations during a span of about 30 minutes To 
avoid possible variability due to the dynamic translation engine of the Crusoe 
processor, most benchmarks were run at least twice to warm up the dynamic 
translation cache, simulation data from all but the last run was disregarded. 

The performance-setting algorithm according to the present technique has 
been designed so that it is unobtrusive to its host platfonn is the way timers are 
handled. For the purpose of the simulations the present technique provided a sub- 
millisecond resolution timer, without changing the way in which the Linux built-in 
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lOms resolution timer worked. TOs was accomplished by piggybacking a timer 
dispatch routine (which checks for timer events) onto often executed parts of the 
kernel, such as the scheduler and system calls. 

5 Since the performance-setting algorithm according to the present technique is 

designed such that it has hooks to the kernel that allow it to intercept certain system 
calls to find interactive episodes and it is invoked on every task switch, it was 
straight-forward to add a few instructions to these hooks to manage timer dispatches. 
Each hook was augmented by implementing a read of the timestamp counter, a 

) comparison against the time stamp of the next timer event and a branch to the timer 
dispatch routine upon success. In practice it was found that this strategy yielded a 
timer with sub-millisecond accuracy. 

Table 2 below details the timer statistics pertaining to the simulations. The 
worst-case timer resolution was bounded by the 10 millisecond (seems to be 
inconsistent with Table 2) time quantum of the scheduler. However, since the events 
that the performance-setting algorithm according to the present technique is interested 
m measuring usually occur close to the timer triggers, the achieved resolution was 
considered to be adequate. It proved to be advantageous that the soft-timers of the 
system stopped ticking when the processor was in sleep mode since this meant that 
the timer interrupts did not change the sleep characteristics of the nmning operating 
system and program applications. The timers used had high resolution but low 
overhead. 

These advantageous features of the timers facilitated development of an 
implementation having both an active mode and a passive mode. In the active mode 
the performance-setting algorithm according to the present technique was in control. 
In the passive mode the built-in LongRun power manager was m charge of 
performance although the intelligem energy manager of the present technique acted as 
an observer of the execution and perfonnance changes. 
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Table 2 



Cost of an access to a timestamp 
counter 



30 to 40 cycles 



Mean interval between timer checks 



Timer accuracy 



Average timer check and dispatch 
duration (including possible execution 
of an event handler) 



~0.1 milliseconds 



1 millisecond 



100 to 150 cycles 



MoxHtonng the performance changes caused by LongRun was accomplished similarly 
to the timer dispatch routine. The intelligent energy manager 120 according to the 
present technique periodically read the performance level of the processor through a 
machme-specific register and compared the result to a previous value. If the two 
values were different, then the change was logged in a buffer. The intelligent energy 
manager according to the present technique includes a tracing mechanism that retains 
a log of significant events in a kernel buffer. This log includes performance-level 
requests from the different policies, task pre-emptions, task IDs (identifiers), and the 
performance levels of the processor. In performing the simulations it was possible to 
compare LongRun and the perfomiance-setting algorithm according to the present 
techmque during the same execution run: LongRun was is in control perfonnance- 
settmg while the intelligent energy manager 120 of the present techmque was 
operable to output the decisions that it would have made on the same workload had it 
been m control. This simulation strategy was used to objectively assess the 
differences between unrepeatable runs of interactive benchmarks between the known 
LongRun technique and the present technique. 



In order to assess the overhead of using the measurement and performance- 
settmg techniques, the performance-setting algorithm according to the presem 
technique was instrumented with markers that kept track of the time spent in the 
performance-setting algorithm code at run-time. Although the run-time overhead of 
the present technique on a Pentium II was found to be around O.Wo to 0.5%, on the 
Transmeta Crusoe processor the overhead was between 1% and 4%. Further 
measurements in virtual machines such as 'VMWare' and 'user-mode-linux' (UML) 



confirmed that the overhead of the perfonnance-setting algorithms according to the 
present technique can be significantly higher in virtual machines than on traditional 
processor architectures. However this overhead could be effectively reduced by 
algorithm optimisation. 

5 

MPEG (Motion Pictures Expert Group) video playback posed a difficult 
challenge for all of the tested perfomiance-setting algorithms. Although the 
performance-setting algorithms typically put a periodic load on the system, the 
performance requirements can vary depending on the MPEG frame-type. Is a 
9 consequence, if a performance-setting algorithm uses a comparatively long time- 
window corresponding to past (highly variable) MPEG frame-decoding events to 
predict future performance requirements, it can miss the execution deadlines for (less- 
repiesentative) more computationally intensive frames. On the other hand, if the 
algorithm looks at only a short interval, then it will not converge to a' single 
performance value but oscillate rapidly between multiple settings. Since each change 
in performance-level incurs a transition delay, rapid oscillation between different 
performance-levels is undesirable. The simulation results for LongRun confirm this 
oscillatory behaviour for the MPEG benchmark. 

The present technique deals with tiiis problem of oscillation for the MPEG 
workload by relying on the interactive performance-setting algoritim, at flie top level 
of the hierarchy to bound worst-case responsiveness. The more conventional interval- 
based perspectives algorithm at the bottom level of the hierarchy is tiius able to take a 
longer-term view of performance-level requirements. 

Figure 8 is a table tiiat details simulation measurement results for the 
'plaympeg' video player (http://www.lokigames.com/development/smpeg.php3) 
playing a variety of MPEG videos. Some of tiie internal variables of the video player 
have been exposed to provide information about how tiie player is affected as the 
result of dynamically changing the processor performance-level during execution 
These figures are shown in the MPEG decode column of the table. In particular, tiie 
'Ahead' variable measures how close to the deadline each frame decoding comes. 
The closeness to tiie deadline is expressed as cumulative seconds during tiie playback 
of each video. For maximum power efficiency, tiie Ahead variable value should be as 
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close to zero as possible, although the slowest perfonnance level of the processor puts 
a lower limit how much the Ahead value can be reduced. An 'Exactly on time field' 
m the right-most column of the table specifies the total number of frames that met 
their deadlines exactly. The more frames that are exactly on time, the closer the 
5 performance-setting algorithm is to the theoretical optimum. TT^e data in the 
Execution Statistics column of the table of Figure 8 was collected by the intelligent 
energy manager 120 monitoring sub-system. To collect information about LongRun 
the intelligent energy manager 120 was used in passive mode to gather a trace of 
performance changes without controlling the processor perfonnance level. The Idle 
) field specifies the fraction of time spent in the idle loop of the kernel (possibly doing 
housekeeping chores or just spinning) whereas the Sleep field specifies the fraction of 
time that the processor actually spends in a low-power sleep mode. It can be seen 
from the table in Figure 8 that for each of these performance measures the present 
technique performs considerably better than LongRun. 

Figure 9 is a table that lists processor performance level statistics collected 
during the runs of each workload. The fraction of time at each performance level is 
computed as a proportion of the total non-idle time during the run of the workload. 
The 'Mean perf- level column of the table specifies the average perfonnance levels 
(as the percentage of peak performance) during the execution of each workload. 
Since, in all cases, the mean performance level for each workload was lower using the 
present technique than for LongRun. the last column specifies the mean performance 
reduction achieved with regard to LongRun. The playback quality for both the 
LongRun workload and the workload of the present technique was the same i.e. 
identical frame rates and no dropped frames. 

The results show that the present technique is more accurately able to predict 
the necessary perfonnance level than the known LongRun technique. Hie increased 
accuracy results in an 1 1 % to 35% reduction of the average perfonnance levels of the 
processor during execution of the benchmarks. Since the amount of work between 
runs of a workload should stay the same, the lower average perfonnance level implied 
that reduced idle and sleep times could be expected when the intelligent energy 
manager of the present technique is enabled. This expectation was affirmed by the 
simulation results. Similarly, the number of frames that exactly meet their deadlines 



increases when the intelligent energy manager of the present technique is enabled and 
the cumulative amount of time when decode is ahead of its deadline is reduced. 

The median performance level (highlighted v^th bold in each column of the 
table of Figure 9) also shows significant reductions. Whereas on most benchmarks 
the performance-setting algorithm according to the present technique settles on a 
single performance level below peak for the greatest fraction of execution time 
(>88%), LongRun usually sets the processor to run at full-speed. The exception to 
this general rule is the 'Danse De Cable' workload, where the performance-setting 
algorithm according to the present technique settles on the lowest two performance 
levels and oscillates between these two levels. The reason for this oscillatory 
behaviour is due to the specific performance levels on the Crusoe processor. The 
performance-setting algorithm according to the present technique would have elected 
to select a performance level of only slightly higher than 300 Mhz so that as the 
performance-level prediction fluctuated above and below the 300 MHz value, the 
target performance-level was quantized to the closest two performance levels. The 
most notable difference in performance between the known LongRun technique and 
the present technique is that LongRun appears to be over-cautious in that it ramps up 
the performance level very quickly when it detects significant amounts of processor 
activity. 

Over all workloads, the average processor performance level with LongRun 
never fell below 80%, whilst the performance level set by the present technique fell to 
as low as 52% for the 'Red's Nightmare small' benchmark. The algorithm according 
to the present technique is more aggressive than LongRun but responds quickly when 
the quality of service appears to have been compromised. Since LongRun does not 
have any information about the interactive performance, it is forced to act 
conservatively on a shorter time fi-ame and the simulation results show that this leads 
to inefficiencies. 

Figure 10 comprises two graphs of results for playback of two different 
MPEG movies entitled 'Legendary' (Figure lOA) and 'Danse de Cable' (Figure 10 
B). Each graph illustrates the fraction of time spent at each of four processor 
performance levels (300, 400, 500, and 600 MHz) for both LongRun and the present 
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technique. Although the playback quality of for each run was identical, it can be seen 
from the graphs that use of the algorithm according to the present technique meant 
that the processor spent significantly longer at below peak performance than it did 
when the LongRun technique specified the performance level. The results for 
playback of the 'Legendary' movie plotted in Figure lOA show that the algorithm 
according to the present technique settles on a performance level of 500 MHz. The 
results for the 'Danse de Cable' movie shown in Figure lOB reveal that using the 
algorithm according to the present technique, the processor svwtched between two 
performance-levels i.e. 300 MHz and 400 MHz. By way of contrast, for both of these 
movies the LongRun performance setting algorithm chose the peak processor speed 
of 600 MHz for a dominant portion of the execution time. 

Figure 1 1 provides qualitative insight into the characteristics of the two 
different performance-setting policies. LongRun keeps switching the performance 
level up and down in fast succession, while the processor performance-level of the 
system when controlled according to the present technique stays close to a target 
performance level. The two graphs of Figure 1 1 A (top row) show the performance 
levels of the processor during a benchmark run with LongRun enabled. Figures 1 IB 
and lie (middle and bottom rows) show performance-level results for the same 
benchmark but with the algorithm of the present technique enabled. Figure 1 IB 
shows the actual performance levels during execution, while Figure 11 C reflects the 
performance level that the performance-setting algorithm according to the present 
technique would request on a processor that could run at arbitrary performance levels 
(given the same max. performance). Note that in some cases, the desired performance 
levels calculated by algorithm according to the present technique the are actually 
below the minimum achievable performance-level on the processor. 

Now consider simulation results for comparison of the two techniques on 
interactive workloads. Due to the difficulty in making interactive benchmark runs 
repeatable, interactive workloads are significantly harder to evaluate than the 
multimedia benchmarks. To circumvent this problem, empirical measurements were 
combined with a simple simulation technique. More specifically, the interactive 
benchmarks were run under the control of the native LongRun power manager and the 
intelligent energy manager 120 according to the present technique was only engaged 
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in passive mode, so that it merely recorded the perfonnance-setting decisions that it 
would have made but did not actually change the performance levels of the processor. 

Figure 12 shows the performance data that was collected during a simulation 
run for assessment of interactive workloads. Figure 12A is a graph of percentage 
performance level against time (in seconds) for the LongRun technique and in this 
case the plotted results correspond to the actual performance levels of the processor 
during the measurement. Figure 12B is a plot of the quantized perfonnance levels 
whereas Figure 12C is a plot of the raw performance levels as a function of time that 
the performance-setting algorithm of the present technique would have set, had it been 
in control of the processor. Note that if the algorithm of the present technique had in 
fact been in control, its performance-setting decisions would have had a different run- 
time impact from those made by LongRun. For this reason the time axes on the 
graphs of Figures 12B and 12C should be regarded as approximations. 
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To get around the time-skew problem in the statistics, the passive 
performance-level traces of the simulations according to the present technique were 
post-processed to assess the impact of the increased execution times that would have 
resulted from the use of Represent technique instead of LongRun. Rather than 
20 looking at the entire performance-level trace, only on the interactive episodes were 
focussed on. ITie interactive perfoimance-setting algorithm of the present technique, 
it includes functionality for finding durations of execution that have a direct impact on 
the user. This technique gives valid readings regardless of which algorithm is in 
control and was thus used to focus our measurements. Once the execution range for 
25 an interactive episode had been isolated, the full-speed equivalent work done during 
the episode was computed for both LongRun and the present technique. Since during 
the measurement LongRun is in control of the CPU speed and it nms faster than it 
would do if the resent technique were in control, the episode duration of results 
corresponding to the present technique must be lengthened. First, the remaining work 
30 is computed for the present technique according to the following formula: 

Workp,„enl technique Remaining = Worku,„gRun "WorkpTOent technique 

Next, the algorithm computed to what extent the length of the interactive 
episode needed to be stretched-assuming that the algorithm of the present technique 
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continued to run at its predicted speed until it reached the panic threshold, at ran at 
flill-speed after that. The statistics were adjusted accordingly. It was found that the 
results using this technique were close to what we observed on similar workloads 
(same benchmark but with a slightly different interactive load) running with the 
5 algorithm according to the present technique in active control of the processor. 
However, when the algorithm according to the present technique was actually in 
control, the number of performance-setting decisions was reduced and the 
performance-levels were more accurate. 



Figure 13 shows the statistics gathered using the above-described time-skew 
correction technique. Each of the six graphs in the figure graph comprises two stacked 
columns. The left-hand column on each graph relates to LongRun whereas the right- 
hand column relates to the present technique. Each column is stacked so as to 
represent the fraction of time spent in interactive episodes at each of the four 
performance levels supported in the computer. These performance levels— from 
bottom up— are from 300 Mhz to 600 Mhz at 100 Mhz increments. Even from a high 
level, it is apparent that the algorithm according to the present technique spends more 
time at lower performance levels than LongRun does. On some benchmarks such as 
Emacs, there is hardly ever a need to go fast and the interactive deadlines are met 
while the machine stays at its lowest possible performance level. At the other end of 
the spectrum is the Acrobat Reader benchmark, which exhibits bimodal behaviour: 
the processor either runs at its peak level or at its minimum. Even on this benchmark 
many of the interactive episodes can complete in time at the minimum performance 
level of the processor. However, when it comes to rendering the pages, the peak 
performance level of the processor is not sufficient to complete its deadlines within 
the user perception threshold. Thus, upon encountering a sufficiently long interactive 
episode, the algorithm according to the present technique switches the processor 
performance-level to its peak. By way of contrast, during the run of the Konqueror 
benchmark, the algorithm according to the present technique can take advantage of 
all four available performance levels of the processor. This can be compared with the 
LongRun strategy, which causes the processor to spend most of its time at the peak 
level. 



Overall, the simulation results detailed above with reference to Figures 8 to 
13, have shown how two performance-setting policies implemented at different levels 
in the software hierarchy behave on a variety of multimedia and interactive 
workloads. It was found that the Transmeta LongRun power manager, which is 
implemented in the processor's firmware, makes more conservative choices than the 
algorithm according to the present technique, which is implemented in the kernel of 
the operating system. On a set of multi-media benchmarks an 1 1% to 35% average 
performance level reduction was achieved by the algorithm according to the present 
technique over that achieved using the known LongRun technique. 

Since the performance-setting algorithm according to the present technique is 
implemented higher in the software stack than LongRun it is able to make decisions 
based on a richer set of run-time information, which in turn translates into increased 
accuracy. 

Although the firmware approach of LongRun was shown to be less accurate than an 
algorithm implemented in the kernel, it does not diminish its usefulness. LongRun has 
the crucial advantage of being operating system agnostic. It is recognised that the gap 
between low and high level implementations could be bridged by to providing a 
baseline performance-setting algorithm such as LongRun in firmware and exposing an 
interface to the operating system for the purpose of (optionally) refining processor 
performance-setting decisions. The hierarchy of performance-setting algorithms 
according to the present technique provides a mechanism to support such design. The 
bottom-most performance-setting policy on the stack could actually be implemented 
in the firmware of the processor. 
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CLAIMS 



1. A method of selecting a performance level to be used by a data processing 
! apparatus capable of operating at a plurality of different performance levels, said 
method comprising the steps of: 

calculating a plurality of performance requests using respective ones of a 
plurality of performance request calculating algorithms; 

combining said plurality of performance requests to form a global 
performance request; and 

selecting said performance level to be used by said data processing apparatus 
from among said plurality of different performance levels in dependence upon said 
global performance request. 

2. A method as claimed in claim 1, when at least one of said plurality of 
perfomance request calculating algorithms calculates a performance request 
independently of other of said plurality of performance request calculating algorithms. 

3. A method as claimed in any one of claims 1 and 2, wherein at least one of said 
plurality of performance request calculating algorithms calculates a performance 
request based upon detected operating characteristics of said data processing 
apparatus. 

4. A method as claimed in any one of the preceding claims, wherein said step of 
selecting is temporally independent of said steps of calculating and combining. 

5. A method as claimed in any one of the preceding claims, wherein said 
plurality of performance request calculating algoritiuns are temporally independent of 
one another. 



6. A method as claimed in any one of the preceding claims, wherein said 
plurality of performance request calculating algorithms are associated with a 
hierarchy of performance request calculating algorithms, a performance request from 
a performance request calculating algorithm being combined with other performance 



requests in dependent of a position of said performance request calculating algorithm 
within said hierarchy of performance request calculating algorithms. 

7. A method as claimed in claim 6, wherein said hierarchy of performance 
request calculating algorithms is fully ordered. 

8. A method as claimed in claim 6, wherein said hierarchy of performance 
request calculating algorithms is partially ordered and an operator is provided for 
combining performance requests from performance request calculating algorithms on 
a hierarchy common level. 

9. A method as claimed in claim 8, wherein said operator is a maximum value 
selector. 

10. A method as claimed in claim 6, wherein a first priority performance request 
from a first priority performance request calculating algorithm can override a second 
priority performance request from a second priority performance request calculating 
algorithm when said first performance request calculating algorithm has a more 
dominant position within said hierarchy of performance request calculating algorithms 
than said second performance request calculating algorithm. 

11. A method as claimed in claimed in any one of the preceding claims, wherein at 
least one of said plurality of performance request calculating algorithms generates a 
command accompanying a performance request specifying how that performance 
request should be combined with other performance requests. 

12. A method as claimed m claims 6 and 1 1, wherein said command specifies that 
said performance request should override any performance request from a 
performance calculating algorithm with a less dominant position in said hierarchy of 
performance request calculating algorithms. 

13. A method as claimed in claims 6 and 1 1, wherein said command specifies that 
said performance request should be selected in preference to any lower performance 
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level performance request from a performance calculating algorithm in a less 
dominant position in said hierarchy of performance request calculating algorithms. 

14. A method as claimed in claim 11, wherein said command specifies that said 
performance request should be ignored irrespective of any other performance level 
requests. 

15. A method as claimed in claim 6, wherein said performance requests are 
combined starting from a performance request from a performance request calculating 
algorithm having a least dominant position within said hierarchy of performance 
request calculating algorithms working through to a performance request from a 
performance request calculating algorithm having a most dominant position within 
said hierarchy of performance request calculating algorithms. 

1 6. A method as claimed in any one of the preceding claims, wherein at least some 
of said steps calculating, combining and selecting are performed by one or more of: 

an operating system kernel; 

firmware of said data processing apparatus; 

hardware within said data processing apparatus. 

17. A method as clauned in any one of the preceding claims, wherein at least one 
of said plurality of performance request calculating algorithms is responsive to 
deadline information from a real time operating system kernel. 

18. A method as claimed in any one of the preceding claims, wherein at least one 
of said plurality of performance request calculating algorithms is responsive to 
information from an operating system kernel. 

19. A method as claimed in any one of the preceding claims, wherein at least one 
of said plurality of performance request calculating algorithms is responsive to 
information from an application program, a device or a device driver. 



20. 



A method as claimed in claim 19, wherein said information is indicative of a 



change in operating conditions and said at least one performance request calculating 
algorithm is operable to recalculate a respective one of said plurality of performance 
requests in response to receipt of said information. 

21. Apparatus for selecting a performance level to be used by a data processing 
apparatus capable of operating at a plurality of different performance levels, said 
apparatus comprising: 

calculating logic operable to calculate a plurality of performance requests 
using respective ones of a plurality of performance request calculating algorithms; 

combining logic operable to combine said plurality of performance requests to 
form a global performance request; and 

selecting logic operable to select said performance level to be used by said 
data processing apparatus from among said plurality of different performance levels in 
dependence upon said global performance request. 

22. Apparatus as claimed in claim 21, when at least one of said plurality of 
performance request calculating algorithms calculates a performance request 
independently of other of said plurality of performance request calculating algorithms. 

23. Apparatus as claimed in any one of clauns 21 and 22, wherein at least one of 
said plurality of performance request calculating algorithms calculates a performance 
request based upon detected operating characteristics of said data processing 
apparatus. 

24. Apparatus as claimed in any one of claims 21, 22 and 23, wherein said step of 
selecting is temporally independent of said steps of calculating and combining. 

25. Apparatus as claimed in any one of claims 21 to 24, wherein said plurality of 
performance request calculating algorithms are temporally independent of one 
another. 

26. Apparatus as claimed in any one of claims 21 to 25, wherein said plurality of 
performance request calculating algorithms are associated with a hierarchy of 
performance request calculating algorithms, a performance request from a 
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performance request calculating algorithm being combined with other performance 
requests in dependent of a position of said performance request calculating algorithm 
within said hierarchy of performance request calculating algorithms. 

27. Apparatus as claimed in claim 26, wherein said hierarchy of performance 
request calculating algorithms is fully ordered. 

28. Apparatus as claimed in claim 26, wherein said hierarchy of performance 
request calculating algorithms is partially ordered and an operator is provided for 
combining performance requests from performance request calculating algorithms on 
a hierarchy common level. 

29. Apparatus as claimed in claim 28, wherein said operator is a maximum value 
selector. 

30. Apparatus as claimed in claim 26, wherein a first priority performance request 
from a first priority performance request calculating algorithm can override a second 
priority performance request from a second priority performance request calculating 
algorithm when said first performance request calculating algorithm has a more 
dominant position within said hierarchy of performance request calculating algorithms 
than said second performance request calculating algorithm. 

31. Apparatus as claimed in claimed in any one of claims 21 to 30, wherein at 
least one of said plurality of performance request calculating algorithms generates a 
command accompanying a performance request specifying how that performance 
request should be combined with other performance requests. 

32. Apparatus as claimed in claims 26 and 31, wherein said command specifies 
that said performance request should override any performance request from a 
performance calculating algorithm with a less dominant position in said hierarchy of 
performance request calculating algorithms. 

33. Apparatus as claimed in claims 26 and 31, wherein said command specifies 
that said performance request should be selected in preference to any lower 



performance level performance request from a performance calculating algorithm in a 
less dominant position in said hierarchy of perfonnance request calculating 
algorithms. 
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Apparatus as claimed in claim 31, wherein said command specifies that said 
performance request should be ignored irrespective of any other perfonnance level 
requests. 

35. Apparatus as claimed in claim 26, wherein said performance requests are 
) combined starting from a performance request from a perfonnance request calculating 
algorithm having a least dominant position within said hierarchy of perfonnance 
request calculating algorithms working through to a perfonnance request from a 
performance request calculating algorithm having a most dominant position within 
said hierarchy of perfonnance request calculating algorithms. 

36. Apparatus as claimed in any one of claims 21 to 35, wherein at least some of 
said steps calculating, combining and selecting are perfonned by one or more of: 

an operating system kernel; 

firmware of said data processing apparatus; 

hardware within said data processing apparatus. 

37. Apparatus as claimed in any one of claims 21 to 36, wherein at least one of 
said plurality of perfonnance request calculating algorithms is responsive to deadline 
information from a real time operating system kernel. 

38. Apparatus as claimed in any one of claims 2 1 to 37, wherein at least one of 
said plurality of perfonnance request calculating algorithms is responsive to 
information from an operating system kernel. 

39. Apparatus as claimed in any one of claims 21 to 38. wherein at least one of 
said plurality of perfonnance request calculating algorithms is responsive to 
infonnation from an application program, a device or a device driver. 

40. Apparatus as claimed in claim 39, wherein said infonnation is- indicative of a 
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change in operating conditions and said at least one performance request calculating 
algorithm is operable to recalculate a respective one of said plurality of performance 
requests in response to receipt of said information. 

5 41. A computer program product bearing a computer program for controlling a 
computer to select a performance level to be used by said computer, said computer 
being capable of operating at a plurality of different performance levels, said 
computer program comprising: 

calculating code operable to calculate a plurality of performance requests 
10 using respective ones of a plurality of performance request calculating algorithms; 

combining code operable to combine said plurality of performance requests to 
form a global performance request; and 

selecting code operable to select said perfonnance level to be used by said data 
processing apparatus from among said plurality of different performance levels in 
1 5 dependence upon said global performance request. 

42. A computer program product as claimed in claim 41 , when at least one of said 
plurality of performance request calculating algorithms calculates a performance 
request independently of other of said plurality of performance request calculating 

20 algorithms. 

43. A computer program product as claimed in any one of claims 41 and 42, 
wherein at least one of said plurality of performance request calculating algorithms 
calculates a performance request based upon detected operating characteristics of said 

25 computer. 

44. A computer program product as claimed in any one of claims 41 to 43, 
wherein said step of selecting is temporally independent of said steps of calculating 
and combining. 

30 

45. A computer program product as claimed in any one of claims 41 to 44, 
wherein said plurality of performance request calculating algorithms are temporally 
independent of one another. 
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46. A computer program product as claimed in any one of claims 41 to 45, 
wherein said plurality of perfomiance request calculating algorithms are associated 
with a hierarchy of performance request calculating algorithms, a performance request 
from a performance request calculating algorithm being combined with other 
performance requests in dependent of a position of said performance request 
calculating algorithm within said hierarchy of performance request calculating 
algorithms. 

47. A computer program product as claimed in claim 46, wherein said hierarchy of 
performance request calculating algorithms is fully ordered. 

48. A computer program product as claimed in claim 46, wherein said hierarchy of 
performance request calculating algorithms is partially ordered and an operator is 
provided for combining performance requests fix)m performance request calculating 
algorithms on a hierarchy common level. 



49. A computer program product as claimed in claim 48, wherein said operator i 
a maximum value selector. 



50. A computer program product as claimed in claim 46. wherein a first priority 
performance request from a fu-st priority performance request calculating algorithm 
can override a second priority performance request from a second priority 
performance request calculating algorithm when said first performance request 
calculating algorithm has a more dominant position within said hierarchy of 
peifomiance request calculating algorithms than said second performance request 
calculating algorithm. 

51. A computer program product as claimed in claimed in any one of claims 41 to 
50, wherein at least one of said plurality of performance request calculating 
algorithms generates a command accompanying a performance request specifying 
how that perfoimance request should be combined with other performance requests. 

52. A computer program product as claimed in claims 46 and 51, wherein said 
command specifies that said performance request should override any performance 
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request from a performance calculating algorithm with a less dominant position in • 
said hierarchy of performance request calculating algorithms. 

53. A computer program product as claimed in claims 46 and 51, wherein said 
command specifies that said performance request should be selected in preference to 
any lower performance level performance request from a performance calculating 
algorithm in a less dominant position in said hierarchy of performance request 
calculating algorithms. 

54. A computer program product as claimed in claim 51, wherein said command 
specifies that said performance request should be ignored irrespective of any other 
performance level requests. 

55. A computer program product as claimed in claim 46, wherein said 
performance requests are combined starting from a performance request from a 
performance request calculating algorithm having a least dominant position within 
said hierarchy of performance request calculating algorithms working through to a 
performance request from a performance request calculating algorithm having a most 
dominant position within said hierarchy of performance request calculating 
algorithms. 

56. A computer program product as claimed in any one claims 46 to 55, wherein 
at least some of said steps calculating, combining and selecting are performed by one 
or more of: 

an operating system kernel; 

firmware of said data processing apparatus; 

hardware within said data processing apparatus. 

57. A computer program product as claimed in any one of claims 41 to 56, 
wherein at least one of said plurality of performance request calculating algorithms is 
responsive to deadline information from a real time operating system kernel. 



58. A computer program product as claimed in any one of claims 41 to 57, 
wherein at least one of said plurality of performance request calculating algorithms is 
responsive to information from an operating system kernel. 



5 59. A computer program product as claimed in any one of claims 41 to 58, 

wherein at least one of said plurality of performance request calculating algorithms is 
responsive to information from an application program, a device or a device driver. 

60. A computer program product as claimed in claim 59, wherein said information 
10 is indicative of a change in operating conditions and said at least one performance 
request calculating algorithm is operable to recalculate a respective one of said 
plurality of performance requests in response to receipt of said information. 
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ABSTRACT 

PiTPirnnMANCE T irWT. SF.T.ECTI Q M TN A DATA PROrFSSTNG SYSTEM 

Performance level selection is carried out by calculating 
a plurality of performance requests using a plurality of performance request 
calculating algorithms, combining those different performance requests to form a 
global performance request and then selecting a performance level in dependence 
upon the global performance level request. The performance request calculating 
algorithms can be arranged in a hierarchy with their performance requests evaluated 
in a sequence starting from the least dominant position in the hierarchy and moving 
through to the most dominant position in the hierarchy. Commands may accompany 
each performance level request to specify how it should be combined with other 
performance level requests. 



[Figure 5] 
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